[AMD] Fuse RMSNorm + FP8 per-token quant for GLM-4.7-FP8 by Jacob0226 · Pull Request #21403 · sgl-project/sglang

Jacob0226 · 2026-03-25T10:57:28Z

🤖 This PR was developed with Claude Code (Claude Opus 4.6)

Summary

Fuse add_rmsnorm_quant_kernel (RMSNorm) with dynamic_per_token_scaled_quant_kernel (FP8 quantization) into a single kernel call using aiter's add_rmsnorm_quant with FUSE_QUANT=true, eliminating redundant global memory round-trips
Auto-detect CompressedTensorsW8A8Fp8 with per-channel weight quantization (e.g. GLM-4.7-FP8) and enable fused path via quant_format="fp8_per_token"
Fix "fp8" in quant_format → quant_format == "fp8" to prevent fp8_per_token from being intercepted by the existing fused_rms_fp8_group_quant path

Changes

File	Change
`communicator.py`	Add `fp8_per_token` path in `prepare_attn` using aiter `add_rmsnorm_quant` (group_size=0 for per-token)
`glm4_moe.py`	Auto-detect FP8 per-token quant scheme on `qkv_proj`, pass `quant_format` to `prepare_attn`, handle `(fp8, scale)` tuple
`fp8_utils.py`	Handle pre-quantized `(fp8, scale)` tuple input in `apply_fp8_ptpc_linear` to skip redundant `per_token_quant_hip`

Scope

Fuses RMSNorm + FP8 per-token quant into a single kernel in prepare_attn, applied across all 92 decoder layers

Test plan

Accuracy (GSM8K on MI355X TP8 with GLM-4.7-FP8):

	Before	After
GSM8K	0.948	0.943

Within margin of error.

Performance (InferenceMax config on MI355X TP8):

Config	ITL Decode Speedup
ISL 1K/8K, OSL 1K, concurrency 4/8/16/32/64	~+1%

Profiling:
Shown on the right side of the figure, fused dynamic_per_token_scaled_quant_kernel into add_rmsnorm_quant_kernel.

🤖 Generated with Claude Code

gemini-code-assist · 2026-03-25T10:57:47Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes the performance of FP8 per-token quantization by fusing RMSNorm and quantization operations into a single kernel. This change reduces memory overhead and improves efficiency, particularly for models like GLM-4.7-FP8. The update also includes robust detection mechanisms for the appropriate quantization schemes and ensures correct routing to the optimized kernels, leading to 95 kernel fusions across attention and dense MLP layers.

Highlights

Kernel Fusion: Fused the add_rmsnorm_quant_kernel (RMSNorm) with dynamic_per_token_scaled_quant_kernel (FP8 quantization) into a single kernel call using aiter's add_rmsnorm_quant with FUSE_QUANT=true, which eliminates redundant global memory round-trips.
Automatic Detection: Implemented auto-detection for CompressedTensorsW8A8Fp8 with per-channel weight quantization (e.g., GLM-4.7-FP8) to enable the fused path via quant_format="fp8_per_token".
Quantization Format Handling: Fixed the logic for detecting the 'fp8' quantization format from "fp8" in quant_format to quant_format == "fp8" to prevent fp8_per_token from being incorrectly intercepted by the existing fused_rms_fp8_group_quant path.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces FP8 per-token quantization support using the aiter library for RMSNorm and linear layers within the SGLang framework, specifically for attention and MLP blocks. Key changes include adding new aiter RMSNorm quantization functions and logic in communicator.py to handle the fp8_per_token format, returning quantized tensors and scales as a tuple. The fp8_utils.py file is updated to process this (tensor, scale) tuple input for FP8 linear operations. Additionally, glm4_moe.py is modified to dynamically detect and pass the appropriate quantization format to the communicator layers, and its forward methods are adapted to correctly handle tuple inputs for hidden states. Review comments suggest improving consistency in quant_format string matching, addressing a potential bug where the scale tensor might be discarded in GLM4MoEBlock.forward_prepare, refactoring duplicated code in communicator.py, restoring missing type hints in apply_fp8_ptpc_linear, and investigating a potential dead code block in GLM4MoEBlock.forward related to tuple handling for MoE layers.

python/sglang/srt/layers/communicator.py

python/sglang/srt/models/glm4_moe.py

python/sglang/srt/layers/communicator.py

python/sglang/srt/layers/quantization/fp8_utils.py

python/sglang/srt/models/glm4_moe.py

gemini-code-assist · 2026-03-26T02:00:18Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2026-03-26T02:04:37Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2026-03-27T04:40:51Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Fuse `add_rmsnorm_quant_kernel` (RMSNorm) with `dynamic_per_token_scaled_quant_kernel` (FP8 quantization) into a single kernel call using aiter's `add_rmsnorm_quant` with FUSE_QUANT=true. This eliminates redundant global memory round-trips between RMSNorm output and FP8 quantization input for models using CompressedTensorsW8A8Fp8 with per-channel weight quantization (e.g. GLM-4.7-FP8). Changes: - communicator.py: Add fp8_per_token path in prepare_attn using aiter add_rmsnorm_quant (group_size=0 for per-token) - glm4_moe.py: Auto-detect FP8 per-token quant scheme on qkv_proj, pass quant_format to prepare_attn, handle (fp8, scale) tuple - fp8_utils.py: Handle pre-quantized (fp8, scale) tuple input in apply_fp8_ptpc_linear to skip redundant per_token_quant_hip - Fix "fp8" in quant_format -> quant_format == "fp8" to prevent fp8_per_token from being intercepted by the fp8 group-quant path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

HaiShaw · 2026-03-31T07:59:03Z

/tag-and-rerun-ci

HaiShaw

@Jacob0226 lint fix pls

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Jacob0226 · 2026-03-31T11:27:57Z

Fixed Lint.

HaiShaw · 2026-04-01T07:34:04Z

@amd-bot ci-status

amd-bot · 2026-04-01T07:39:39Z

@HaiShaw

CI Status for PR #21403

PR: [AMD] Fuse RMSNorm + FP8 per-token quant for GLM-4.7-FP8
Changed files: communicator.py (+86/-14), fp8_utils.py (+13/-1), glm4_moe.py (+47/-2)

Job	Error	Related?	Explanation	Log
stage-c-test-large-8-gpu-amd-mi35x (0)	`RuntimeError: invalid argument for batch_prefill`	🟢 Unlikely	Crash in aiter's `mha_batch_prefill` kernel running Qwen3-Coder-Next — completely different code path	Log
stage-a-test-1-gpu-small	`TimeoutError: Server failed to start`	🟢 Unlikely	Corrupted HuggingFace model cache on runner `5090-b-runner-4` — infrastructure issue	Log
build-and-test	`RuntimeError: Non-base64 digit found`	🟢 Unlikely	XPU test `test_deepseek_ocr.py` passes relative path as image_data — pre-existing test bug	Log
pr-gate / pr-gate	CI rate limit exceeded	🟢 Unlikely	User triggered CI within 120-min cooldown window	Log
pr-test-finish / pr-test-amd-finish / wait-for-stage-a / finish	Upstream failures	🟢 Unlikely	Aggregator gates that failed because upstream jobs failed	—

Details

stage-c-test-large-8-gpu-amd-mi35x (partition 0): The Qwen3-Coder-Next model crashed during warmup prefill in aiter_backend.py:2365 → mha_batch_prefill_func with FP8 KV cache on MI35X hardware. The PR does not touch aiter_backend.py, qwen3_next.py, or any code in the attention forward path. The Qwen3Next model never passes a quant_format to prepare_attn (it always uses the default ""), so none of the PR's new branches in communicator.py are exercised. The condition change from "fp8" in quant_format to quant_format == "fp8" is also a no-op for all existing callers (DeepSeek passes exactly "fp8", "mxfp4", or "").

stage-a-test-1-gpu-small: The test_srt_backend.py test failed because the HuggingFace cache for meta-llama/Llama-3.1-8B-Instruct on runner 5090-b-runner-4 had a corrupted .incomplete file. Cache cleanup hit a race condition (ENOENT), the model couldn't be downloaded in time, and the server startup timed out after 300s. Pure infrastructure issue.

build-and-test (XPU): The test_deepseek_ocr.py test passes "../../examples/assets/example_image.png" (a relative path) as image_data. The server's get_image_bytes() doesn't recognize relative paths and tries to base64-decode it, failing with Non-base64 digit found. This is a pre-existing test bug unrelated to this PR.

pr-gate: Rate limiter blocked user Jacob0226 who re-triggered CI ~64 minutes after the previous run, within the 120-minute cooldown. Will resolve on its own with a re-trigger after the cooldown expires.

Summary

All 4 independent failures are unrelated to this PR. None of the failing tests exercise the code paths changed by this PR (GLM-4 MOE model's fused RMSNorm+FP8 per-token quantization on AMD). The AMD CI jobs that are relevant (MI325 1-GPU, 2-GPU, 4-GPU, 8-GPU, and MI35X partition 1) all passed.

Generated by amd-bot using Claude Code CLI

HaiShaw

Please double check the applicable kernels scope on gfx95 and _use_aiter vs. _use_aiter only

python/sglang/srt/layers/quantization/fp8_utils.py

python/sglang/srt/layers/communicator.py

Made-with: Cursor

HaiShaw · 2026-04-10T08:19:46Z

@Jacob0226 May you fix conflict?

Keep quant_format == "fp8" (exact match) to prevent fp8_per_token from being intercepted by the fp8 group-quant path, while preserving the NSA bf16 passthrough logic (_nsa_needs_bf16 / 3-tuple packing) from upstream PR sgl-project#22258. Made-with: Cursor

Jacob0226 · 2026-04-10T08:34:34Z

@Jacob0226 May you fix conflict?

Done. Merged upstream/main and resolved the conflict in communicator.py.

The conflict was between this PR's quant_format == "fp8" (exact match to avoid intercepting fp8_per_token) and PR #22258's NSA bf16 passthrough logic. The resolution keeps both:

Exact match quant_format == "fp8" for the group-quant path (with NSA bf16 passthrough preserved)
Separate elif for quant_format == "fp8_per_token"

…#21403) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist bot reviewed Mar 25, 2026

View reviewed changes

Jacob0226 force-pushed the jacob/fused_rmsnorm_quant branch 3 times, most recently from e784322 to 61b4a63 Compare March 27, 2026 04:37

Jacob0226 marked this pull request as ready for review March 27, 2026 04:40

Jacob0226 requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, b8zhong, ch-wan, ispobock and merrymercy as code owners March 27, 2026 04:40

Jacob0226 force-pushed the jacob/fused_rmsnorm_quant branch from 61b4a63 to 0845ef2 Compare March 30, 2026 01:47

github-actions bot added the run-ci label Mar 31, 2026

Merge branch 'main' into jacob/fused_rmsnorm_quant

cc59ca4

HaiShaw requested changes Mar 31, 2026

View reviewed changes

Fix lint: isort and black formatting

888bacf

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Jacob0226 requested a review from HaiShaw April 1, 2026 01:53

HaiShaw requested changes Apr 4, 2026

View reviewed changes

python/sglang/srt/layers/quantization/fp8_utils.py Show resolved Hide resolved

python/sglang/srt/layers/communicator.py Show resolved Hide resolved

Add aiter-only docstrings per reviewer feedback

0f6c01d

Made-with: Cursor

HaiShaw approved these changes Apr 10, 2026

View reviewed changes

HaiShaw merged commit 7e4e1dc into sgl-project:main Apr 11, 2026
85 of 105 checks passed

pyc96 pushed a commit to pyc96/sglang that referenced this pull request Apr 14, 2026

[AMD] Fuse RMSNorm + FP8 per-token quant for GLM-4.7-FP8 (sgl-project…

10be95a

…#21403) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Conversation

Jacob0226 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Scope

Test plan

Uh oh!

gemini-code-assist bot commented Mar 25, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot commented Mar 26, 2026

Uh oh!

gemini-code-assist bot commented Mar 26, 2026

Uh oh!

gemini-code-assist bot commented Mar 27, 2026

Uh oh!

HaiShaw commented Mar 31, 2026

Uh oh!

HaiShaw left a comment

Choose a reason for hiding this comment

Uh oh!

Jacob0226 commented Mar 31, 2026

Uh oh!

HaiShaw commented Apr 1, 2026

Uh oh!

amd-bot commented Apr 1, 2026

CI Status for PR #21403

Details

Summary

All 4 independent failures are unrelated to this PR. None of the failing tests exercise the code paths changed by this PR (GLM-4 MOE model's fused RMSNorm+FP8 per-token quantization on AMD). The AMD CI jobs that are relevant (MI325 1-GPU, 2-GPU, 4-GPU, 8-GPU, and MI35X partition 1) all passed.

Uh oh!

HaiShaw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HaiShaw commented Apr 10, 2026

Uh oh!

Jacob0226 commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Jacob0226 commented Mar 25, 2026 •

edited

Loading